1 About the course


In this course we will discuss of the Seurat package. Seurat was originally developed as a clustering tool for scRNA-seq data, however in the last few years the focus of the package has become less specific and at the moment Seurat is a popular R package that can perform QC, analysis, and exploration of scRNA-seq data. Although the authors provide several tutorials, here we provide an overview by following an example created by the authors of Seurat. We mostly use default values in various function calls, for more details please consult the documentation and the authors.

This course is directly inspired from http://satijalab.org/seurat/pbmc3k_tutorial.html.

2 Installation / Packages

install.packages('Seurat')
library(Seurat)
install.packages('tidyverse')
library(tidyverse)

3 Seurat - Guided Clustering Tutorial


3.1 Data

For this workshop, we will be analyzing a dataset of Peripheral Blood Mononuclear Cells (PBMC) available from 10x Genomics and downloaded from the Seurat tutorial. There are 2,700 cells that were sequenced on the Illumina NextSeq 500. The raw data can be found here
            https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

Seurat enables easy loading of sparse data matrices provided by 10X Genomics. Read10X takes one parameter data.dir which is the path to the directory containing the matrix.mtx, genes.tsv and barcodes.tsv files provided by 10X (using CellRanger pipeline). 10x Genomics data can alternatively be opened using the cellrangerRkit package - not shown here.

##-- Set working directory
setwd('~/Downloads/')

##-- Load the PBMC dataset
pbmc.data <- Read10X(data.dir = 'Data/filtered_gene_bc_matrices/hg19/')

3.2 Setup seurat object class

Seurat introduces its own object class: seurat. All calculations in this workshop (dimension reduction, differential expression analysis, etc.) are performed using an object of this class.

We first initialize the seurat object with the raw (non-normalized data). We choose to keep, as proposed in the Seurat tutorial, all genes expressed in >= 3 cells (about 0.1% of the data) and all cells with at least 200 detected genes.

##-- Setup Seurat object class
PBMC <- CreateSeuratObject(raw.data = pbmc.data,
                           min.cells = 3, # Include genes with detected expression in at least this many cells (3 cells)
                           min.genes = 200, # Include cells where at least this many genes are detected (200 genes)
                           project = '10x_PBMC')

3.3 Expression Quality Control

The first step in the analysis of a scRNA-seq experiment is quality control (QC).

Seurat allows us to easily explore QC metrics and filter cells based on any user-defined criteria. The number of genes and UMIs (nGene and nUMI) are automatically calculated. For non-UMI data, nUMI represents the sum of the non-normalized values within a cell.

As proposed in the Seurat tutorial, the percentage of mitochondrial genes is calculated and stored in percent.mito using AddMetaData. object@raw.data is used since this represents non-transformed (normalization and log-transformation) counts. As suggested by Ilicic et al. (2016), high mitochondrial gene expression might be suggestive of low-quality libraries. The percent of UMI mapping to mitochondrial genes is then a common scRNA-seq QC metric.

##-- QC metrics - Percentage of mitochondrial genes
#- percent.mito variable
mito.genes <- grep(pattern = '^MT-', x = rownames(x = PBMC@data), value = TRUE) # 13 mitochondrial genes
percent.mito <- Matrix::colSums(PBMC@raw.data[mito.genes, ]) / Matrix::colSums(PBMC@raw.data) # Matrix package is needed
#- Use of AddMetaData to add column to object@meta.data (great place to store QC metrics)
PBMC <- AddMetaData(object = PBMC, 
                    metadata = percent.mito, 
                    col.name = 'percent.mito')

To visualize the QC metrics, there are two functions implemented in Seurat: VlnPlot and GenePlot.

VlnPlot draws a violin plot of the QC metrics.

##-- QC metrics - VlnPlot()
VlnPlot(object = PBMC, 
        features.plot = c('nGene', 'nUMI', 'percent.mito'), 
        nCol = 3)

GenePlot creates a scatter plot of two features. It is usually used to visualize gene-gene relationships, but can be used for anything calculated by the object, i.e. columns in object@meta.data, PC scores, etc.

##-- QC metrics - GenePlot()
par(mfrow = c(1, 2))
GenePlot(object = PBMC, gene1 = 'nUMI', gene2 = 'percent.mito')
GenePlot(object = PBMC, gene1 = 'nUMI', gene2 = 'nGene')

In this example, nUMI and nGene are highly correlated (Pearson’s correlation - 0.95) - it is not always the case. There is a subset of cells with an outlier level of high-mitochondrial percentage and low UMI content: those cells are filtered out. We also notice that some cells with high UMI/gene content compare to the other cells. Those cells might be doublets; they are discarded.
FilterCells creates a seurat object containing only a subset of the cells in the original object. Here, we use the same thresholds as in the Seurat tutorial. Cells that have unique gene counts over 2,500 or less than 200 and more than 5% of mitochondrial genes are filtered out - see below.

##-- Filtering
PBMC_2 <- FilterCells(object = PBMC, 
                      subset.names = c('nGene', 'percent.mito'), 
                      low.thresholds = c(200, -Inf), 
                      high.thresholds = c(2500, 0.05))
# -Inf and Inf should be used if you don't want a lower or upper threshold

3.4 Normalization

After removing unwanted cells from the dataset, the next step is to normalize the data. By default, Seurat employs a global-scaling normalization method LogNormalize that normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. More methods will be added very shortly.

##-- Normalization
PBMC_2 <- NormalizeData(object = PBMC_2, 
                        normalization.method = 'LogNormalize', 
                        scale.factor = 10000)

3.5 Detection of highly variable genes

Seurat calculates highly variable genes and focuses on these for downstream analysis. FindVariableGenes calculates the average expression and dispersion for each gene, places these genes into bins, and then calculates a z-score for dispersion within each bin. This helps control for the relationship between variability and average expression. New methods for variable gene expression identification are coming soon. Seurat developers suggest that users set these parameters to mark visual outliers on the dispersion plot, but the exact parameter settings may vary based on the data type, heterogeneity in the sample, and normalization strategy.
The parameters here identify about 2,000 variable genes, and represent typical parameter settings for UMI data that is normalized to a total of 10,000 molecules.

##-- Highly variable genes
PBMC_2 <- FindVariableGenes(object = PBMC_2, 
                            mean.function = ExpMean, 
                            dispersion.function = LogVMR,
                            x.low.cutoff = 0.0125, 
                            x.high.cutoff = 3, 
                            y.cutoff = 0.5)

##-- Highly variable genes
length(PBMC_2@var.genes)
## [1] 1838

3.6 Dealing with confounders

Single-cell data likely contains unwanted sources of variation. This could include not only technical noise, but batch effects, or even biological sources of variation (cell cycle stage). As proposed by Buettner et al. (2015), regressing uninteresting sources of variation can improve downstream analyses. ScaleData constructs linear models to predict gene expression based on user-defined variables, and the resulting residuals are then scaled and centered. The scaled z-scored residuals of these models are stored in object@scale.data, and are used for dimensionality reduction and clustering. Cell-cell variation in gene expression driven by batch, cell alignment rate, the number of detected molecules, and mitochondrial gene expression can be regressed out.
In this example here, we regress on the number of detected molecules per cell (nUMI) as well as the percentage mitochondrial gene content.

##-- Scaling the data and removing unwanted sources of variation - this function takes about 1-2 minutes, might be longer
PBMC_2 <- ScaleData(object = PBMC_2, 
                    model.use = 'linear',
                    vars.to.regress = c('nUMI', 'percent.mito')) # nUMI is used as proxy for Cellular Detection Rate

3.7 Perform linear dimensional reduction

Principal component analysis (PCA) is then performed on the scaled-corrected data. By default, the genes in object@var.genes are used as input, but can be alternatively defined using pc.genes. Running dimensionality reduction on highly variable genes can improve performance. However, according to Seurat authors, with some types of data (UMI) - especially after removing technical effects, PCA returns similar results when run on much larger subsets of genes, including the whole transcriptome.

##-- PCA
PBMC_2 <- RunPCA(object = PBMC_2, 
                 pc.genes = PBMC_2@var.genes, 
                 pcs.compute = 20,
                 do.print = FALSE)

Seurat provides several useful ways of visualizing both cells and genes that define the PCA, including PrintPCA, VizPCA, PCAPlot, and PCHeatmap.

PrintPCA prints a set of genes that most strongly define a set of principal components.

##-- PCA - PrintPCA
PrintPCA(object = PBMC_2, 
         pcs.print = 1:5,
         genes.print = 5) # genes.print is number of genes to print for each PC
## [1] "PC1"
## [1] "CST3"   "TYROBP" "FCN1"   "LST1"   "AIF1"  
## [1] ""
## [1] "PTPRCAP" "IL32"    "LTB"     "CD2"     "CTSW"   
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "NKG7" "GZMB" "PRF1" "CST7" "GZMA"
## [1] ""
## [1] "CD79A"    "MS4A1"    "HLA-DQA1" "TCL1A"    "HLA-DQB1"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "PF4"   "PPBP"  "SDPR"  "SPARC" "GNG11"
## [1] ""
## [1] "CYBA"     "HLA-DPA1" "HLA-DPB1" "HLA-DRB1" "CD37"    
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "IL32"   "GIMAP7" "AQP3"   "FYB"    "MAL"   
## [1] ""
## [1] "CD79A"    "HLA-DQA1" "CD79B"    "MS4A1"    "HLA-DQB1"
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "FCER1A"  "LGALS2"  "MS4A6A"  "S100A8"  "CLEC10A"
## [1] ""
## [1] "FCGR3A"        "CTD-2006K23.1" "IFITM3"        "ABI3"         
## [5] "CEBPB"        
## [1] ""
## [1] ""

VizPCA visualizes top genes associated with principal components.

##-- PCA - VizPCA
VizPCA(object = PBMC_2,
       num.genes = 30,
       pcs.use = 1:2)

PCAPlot graphs the output of a PCA analysis.

##-- PCA - PCAPlot
PCAPlot(object = PBMC_2, 
        dim.1 = 1, 
        dim.2 = 2)

3.8 Selection of PCs for visualization and clustering

To overcome the extensive technical noise in any single gene for scRNA-seq data, Seurat uses principal components (PCs) to visualize and cluster cells. Each PC essentially representing a metagene that combines information across a correlated gene set. Determining how many PCs to include downstream is therefore a crucial step.

To determine the number of PCs, there are three main approaches:

  • jackStraw procedure: This is a statistical resampling procedure which is used to essentially construct a null distribution for PC scores and will associate each PC with a p-value to enable the assessment of significance in a more formal statistical framework.

  • Elbow Plot: Plotting the eigenvalues in decreasing order is a useful visual heuristic - the elbow in the plot tends to reflect a transition from informative PCs to those that explain comparatively little variance. This point generally corresponds well with the significant PCs identified via the jackStraw procedure, but is much faster to obtain.

  • Supervised Analysis: Seurat developers suggest that users should explore the PCs chosen for downstream analysis. PCHeatmap can display the extremes accross both genes and cells, and can be useful to help exclude PCs that may be driven primarily by ribosomal/mitochondrial or cell cycle genes.

3.8.1 JackStraw procedure

A resampling test inspired by the jackStraw procedure (Chung and Storey, 2015) is implemented within Seurat. It randomly permutes a subset of the data (1% by default) and re-runs PCA, constructing a null distribution of gene scores, and repeat this procedure (100 by default). It then identifies significant PCs as those who have a string enrichment of low p-value genes.

This process is time-consuming for large datasets, and may not return a clear PC cutoff. There are more approximate techniques that can be used to reduce computation time. Please, do not run this procedure.

##-- JackStraw procedure
PBMC_2 <- JackStraw(object = PBMC_2, 
                    num.pc = 20, 
                    num.replicate = 100, 
                    do.print = FALSE)

JackStrawPlot provides a visualization tool for comparing the distribution of p-values for each PC with a uniform distribution (dashed line). Significant PCs show a strong enrichment of genes with low p-values (solid curve above the dashed line).
In this case it appears that PCs 1-11 are significant.

##-- JackStraw procedure
JackStrawPlot(object = PBMC_2, PCs = 1:12)

3.8.2 Elbow plot

There is a more ad hoc method for determining which PCs to use: look at a plot of the standard deviations of the PCs and draw your cutoff where there is a clear elbow in the graph. This can be done with PCElbowPlot. This approach is a heuristic that is commonly used, and can be calculated instantly.
In this example, it looks like the elbow would fall around PC 9.

##-- PCElbowPlot
PCElbowPlot(object = PBMC_2,
            num.pc = 20)

3.8.3 Supervised analysis

PCHeatmap can also be useful when trying to decide which PCs to include for further downstream analyses. We recommend you to set cells.use to a number plots the extreme cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. This approach is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example. There is no PC driven primarily by ribosomal/mitochondrial or cell cycle genes.

##-- PCHeatmap
PCHeatmap(object = PBMC_2, 
          pc.use = 1:12, 
          cells.use = 500, 
          do.balanced = TRUE, 
          label.columns = FALSE, 
          use.full = FALSE)

To conclude, PC selection is an important step, but can be challenging. In this example, all three approaches yielded similar results: between 5-10 PCs. As proposed in the Seurat tutorial, we select 10 PCs for downstream analyses (visualization and clustering). We recommend you to always use those three approaches to select your PCs.

3.9 Perform non-linear dimensional reduction (tSNE visualization)

To visualize and explore single-cell datasets, Seurat uses t-Distributed Stochastic Neighbor Embedding (tSNE). This is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets such as scRNA-seq (van der Maaten and Hinton, 2008).

As input to the tSNE, we suggest using the top 10 PCs (see previous section) (same PCs as input to the clustering analysis), although computing the tSNE based on scaled gene expression is also supported using the genes.use argument. Here, Barnes-Hut approximations implementation is used, allowing it to be applied on large real-world datasets.

##-- tSNE
PBMC_2 <- RunTSNE(object = PBMC_2, 
                  dims.use = 1:10, 
                  do.fast = TRUE) # do.fast (if TRUE) uses the Barnes-hut implementation, which runs faster, but is less flexible.
##-- tSNE
TSNEPlot(object = PBMC_2)

3.10 Clustering

The next step is to cluster the cells. Many clustering methods have already been developed (Bacher and Kendziorski, 2016).
Seurat implements a graph-based clustering approach. Distances between the cells are calculated based on previously identified PCs (Top 10 here). This approach is heavily inspired by recent manuscripts which applied graph-based clustering approaches to scRNA-seq data: SNN-Cliq (Xu and Su, 2015) and CyTOF data - PhenoGraph (Levine et al., 2015). Briefly, these methods embed cells in a graph structure (e.g. K-neareast neighbor (KNN) graph), with edges drawn between cells with similar gene expression patterns, and then attempt to partition this graph into highly interconnected quasi-cliques or communities.
Seurat first constructs a KNN graph based on the euclidean distance in PCA space, and refines the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard distance). To cluster the cells, it applies modularity optimization techniques (Blondel et al., 2008), to iteratively group cells together, with the goal of optimizing the standard modularity function. Modularity is designed to measure how well a network is divided into modules.

    

    

FindClusters implements the procedure, and contains a resolution parameter that sets the granularity of the downstream clustering, with increased values leading to a greater number of clusters. Seurat authors found that setting this parameter between 0.6-1.2 typically returns good results for single cell datasets of around 3,000 cells. Optimal resolution often increases for larger datasets. The clusters are saved in the object@ident slot.

Cells within the graph-based clusters determined above should co-localize on the tSNE plot. This is because the tSNE aims to place cells with similar local neighborhoods in high-dimensional space together in low-dimensional space.
Here, as proposed in the Seurat tutorial, we set a resolution of 0.6 and a number of neighbors of 30 (by default).

##-- Clustering
PBMC_2 <- FindClusters(object = PBMC_2, 
                       reduction.type = 'pca', 
                       dims.use = 1:10, 
                       resolution = 0.6, 
                       k.param = 30,
                       print.output = 0, 
                       save.SNN = TRUE) # save.SNN = T saves the SNN so that the clustering algorithm can be rerun using the same graph but with a different resolution value (see docs for full details)

Seurat implements a really useful feature that is the ability to recall the parameters that were used in the latest function calls for commonly used functions. For FindClusters, they provide PrintFindClustersParams to print a nicely formatted summary of the parameters that were chosen.

##-- Clustering
PrintFindClustersParams(object = PBMC_2)
## Parameters used in latest FindClusters calculation run on: 2018-01-03 19:38:29
## =============================================================================
## Resolution: 0.6
## -----------------------------------------------------------------------------
## Modularity Function    Algorithm         n.start         n.iter
##      1                   1                 100             10
## -----------------------------------------------------------------------------
## Reduction used          k.param          k.scale          prune.SNN
##      pca                 30                25              0.0667
## -----------------------------------------------------------------------------
## Dims used in calculation
## =============================================================================
## 1 2 3 4 5 6 7 8 9 10

There are eight clusters.

##-- tSNE - Clustering
TSNEPlot(object = PBMC_2)

Given that this clustering step is quite long to run, we recommend you to save the seurat object at this point so that it can easily be loaded back in without having to re-run the computationally intensive steps performed above, or easily shared with collaborators.

##-- Save
saveRDS(PBMC_2, file = 'PBMC_tutorial.rds')

3.11 Finding differentially expressed genes and cluster biomarkers

Seurat is able to find biomarkers that define clusters via differential expression. By default, FindMarkers identifies positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. FindAllMarkers automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.

The min.pct argument requires a gene to be detected at a minimum percentage in either of the two groups of cells, and the logfc.threshold argument requires a gene to be differentially expressed (on average) by some amount between the two groups. You can set both of these to 0, but with a dramatic increase in time - since this will test a large number of genes that are unlikely to be highly discriminatory. As another option to speed up these computations, max.cells.per.ident can be set. This will downsample each identity class to have no more cells than whatever this is set to. While there is generally going to be a loss in power, the speed increases can be significiant and the most highly differentially expressed genes will likely still rise to the top.

Seurat has nine tests for differential expression which can be set with the test.use argument: Wilcoxon test (‘wilcox’ - by default), Likelihood-ratio test for single cell gene expression (‘bimod’), ROC test (‘roc’), Student’s t-test (‘t’), LRT test based on tobit-censoring models (‘tobit’), Likelihood ratio test assuming an underlying poisson distribution (‘poisson’ - use only for UMI-based datasets), Likelihood ratio test assuming an underlying negative binomial distribution (‘negbinom’ - use only for UMI-based datasets), GLM-framework that treates cellular detection rate (CDR) as a covariate (‘MAST’) and DE based on a model using the negative binomial distribution (‘DESeq2’).
Here, we choose to use MAST (Finak et al., 2015). MAST proposes a hurdle model tailored to the analysis of scRNA-seq data. A logistic regression model is used to test differential expression rate between groups, while a Gaussian generalized linear model (GLM) describes expression conditionally on non-zero expression estimates. The model also includes the CDR as a covariate to correct for biological and technical nuisance factors that can affect the number of genes detected in a cell (e.g. cell size and amplification bias).

##-- Marker genes
#- Find all markers of cluster 1
cluster1.markers <- FindMarkers(object = PBMC_2, 
                                ident.1 = 1, 
                                min.pct = 0.25,
                                test.use = 'MAST',
                                latent.vars = 'nUMI') # nUMI as proxy for CDR
##-- Marker genes
#- Find all markers of cluster 1 - Top 5
print(x = head(x = cluster1.markers, n = 5)) 
##        p_val avg_logFC pct.1 pct.2 p_val_adj
## S100A9     0  3.827593 0.996 0.216         0
## S100A8     0  3.786535 0.973 0.123         0
## LYZ        0  3.116197 1.000 0.517         0
## LGALS2     0  2.634722 0.908 0.060         0
## FCN1       0  2.369524 0.956 0.150         0
##-- Marker genes
#- Find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(object = PBMC_2, 
                                ident.1 = 5, 
                                ident.2 = c(0, 3),
                                min.pct = 0.25,
                                test.use = 'MAST',
                                latent.vars = 'nUMI') # nUMI as proxy for CDR
print(x = head(x = cluster5.markers, n = 5))
##-- Marker genes
#- Find all markers distinguishing cluster 5 from clusters 0 and 3 - Top 5
print(x = head(x = cluster5.markers, n = 5))
##                p_val  avg_logFC pct.1 pct.2     p_val_adj
## GNLY   3.155255e-163  3.5147178 0.961 0.143 4.327116e-159
## RPS12  1.370564e-151 -1.0195287 0.994 1.000 1.879592e-147
## GZMB   4.490788e-151  3.1950212 0.955 0.084 6.158666e-147
## RPS27  1.206599e-143 -0.9156105 0.987 0.999 1.654730e-139
## TYROBP 1.041801e-133  2.3629902 0.903 0.140 1.428726e-129

The code below takes some time - about 5 minutes here.

##-- Marker genes
#- Find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(object = PBMC_2, 
                               only.pos = TRUE,
                               min.pct = 0.25,
                               logfc.threshold = 0.25, # log-FC
                               return.thresh = 0.01, # Only return markers that have a p-value < 0.01
                               test.use = 'MAST',
                               latent.vars = 'nUMI') # nUMI as proxy for CDR
pbmc.markers %>% 
  group_by(cluster) %>% 
  top_n(2, avg_logFC)
##-- Marker genes
#- Find markers for every cluster compared to all remaining cells, report only the positive ones - Top 2 for each cluster
pbmc.markers %>% 
  group_by(cluster) %>% 
  top_n(2, avg_logFC)

Seurat implements several tools for visualizing marker expression. VlnPlot (shows expression probability distributions across clusters), and FeaturePlot (visualizes gene expression on a tSNE or PCA plot) are the most commonly used visualizations.

##-- Marker genes - Visualization - VlnPlot
VlnPlot(object = PBMC_2, 
        features.plot = c('PPBP', 'CD79A')) # You can also plot it using UMI with use.raw = TRUE

##-- Marker genes - Visualization - FeaturePlot
FeaturePlot(object = PBMC_2, 
            features.plot = c('MS4A1', 'GNLY', 'CD3E', 'CD14', 'FCER1A', 'FCGR3A', 'LYZ', 'PPBP', 'CD8A'), 
            cols.use = c('grey', 'blue'), 
            reduction.use = 'tsne')

DoHeatmap generates an expression heatmap for given cells and genes. In this case, we are plotting the top 10 markers for each cluster.

##-- Marker genes - Visualization - DoHeatmap
top10 <- pbmc.markers %>% 
  group_by(cluster) %>% 
  top_n(10, avg_logFC) 
DoHeatmap(object = PBMC_2, 
          genes.use = top10$gene, 
          slim.col.label = TRUE, 
          remove.key = TRUE)

3.12 Assigning cell type identity to clusters

Cluster biomarkers are used to define cell type identity to clusters. In this dataset, we have fortunately access to well-known canonical markers that can be used to easily match the unbiased clustering to known cell types:

  • IL7R - CD4 T cells - Cluster 0;

  • CD14, LYZ - CD14+ Monocytes - Cluster 1;

  • MS4A1 (CD20) - B cells - Cluster 2;

  • CD8A - CD8 T cells - Cluster 3;

  • FCGR3A (CD16), MS4A7 - CD16+ Monocytes - Cluster 4;

  • GNLY, NKG7 - NK cells - Cluster 5;

  • FCER1A, CST3 - Dendritic cells - Cluster 6;

  • PPBP - Megakaryocytes - Cluster 7.

We can then assign cell type identity to clusters.

##-- Assigning cell type identity to clusters
current.cluster.ids <- c(0, 1, 2, 3, 4, 5, 6, 7)
new.cluster.ids <- c('CD4 T cells', 'CD14+ Monocytes', 'B cells', 'CD8 T cells', 'CD16+ Monocytes', 'NK cells', 'Dendritic cells', 'Megakaryocytes')
PBMC_2@ident <- plyr::mapvalues(x = PBMC_2@ident, 
                                 from = current.cluster.ids, 
                                 to = new.cluster.ids)
##-- Assigning cell type identity to clusters
TSNEPlot(object = PBMC_2, 
         do.label = TRUE,
         label.size = 4)

3.13 Subdivisions within cell types

As explained in the clustering section (see above), if you perturb some of your parameters (such as resolution or changing the number of PCs), you won’t have exactly the same number of clusters.
For example, in this case, when we set resolution = 0.8, we have now nine clusters. The CD4 T cells cluster is divided into two sub-clusters.

##-- Clustering
PBMC_2 <- FindClusters(object = PBMC_2, 
                       reduction.type = 'pca', 
                       dims.use = 1:10, 
                       resolution = 0.8, 
                       k.param = 30,
                       print.output = 0, 
                       save.SNN = TRUE) # save.SNN = T saves the SNN so that the clustering algorithm can be rerun using the same graph but with a different resolution value (see docs for full details)
##-- tSNE - Clustering
TSNEPlot(object = PBMC_2)

##-- Marker genes
#- Find markers distinguishing cluster 0 from cluster 1
tcells.markers <- FindMarkers(object = PBMC_2, 
                              ident.1 = 0,
                              ident.2 = 1,
                              min.pct = 0.25,
                              test.use = 'MAST',
                              latent.vars = 'nUMI') # nUMI as proxy for CDR
##-- Marker genes
#- Find markers distinguishing cluster 0 from cluster 1 - Top 10
print(x = head(x = tcells.markers, n = 10))
##                p_val  avg_logFC pct.1 pct.2    p_val_adj
## B2M     6.692617e-36 -0.2751629 1.000 1.000 9.178255e-32
## S100A4  8.170458e-35 -0.6316306 0.695 0.889 1.120497e-30
## IL32    1.011383e-30 -0.5327842 0.758 0.925 1.387011e-26
## RPS3A   7.665023e-25  0.2869979 1.000 0.993 1.051181e-20
## CLIC1   1.173398e-22 -0.6116589 0.338 0.618 1.609198e-18
## ACTB    1.975907e-21 -0.3481958 0.990 0.993 2.709759e-17
## SRGN    1.491794e-19 -0.6425341 0.338 0.585 2.045846e-15
## HLA-A   1.007978e-18 -0.2703460 0.985 0.986 1.382341e-14
## HLA-C   1.105000e-18 -0.2784078 0.985 0.996 1.515397e-14
## S100A11 1.800522e-18 -0.5813084 0.305 0.542 2.469236e-14

Many markers tend to be expressed in Cluster 1 (i.e. S100A4). However, we do observe that CCR7 is up-regulated in Cluster 0 (CCR7 being a marker of memoy T cells). It strongly indicates that we are able to differentiate memory from naive CD4 cells.

##-- Marker genes - Visualization - FeaturePlot
FeaturePlot(object = PBMC_2, 
            features.plot = c('S100A4', 'CCR7'), 
            cols.use = c('grey', 'blue'), 
            reduction.use = 'tsne')

The memory/naive split is bit weak, and we would probably benefit from looking at more cells to see if this becomes more convincing.

4 sessionInfo()

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.13.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] bindrcpp_0.2        forcats_0.2.0       stringr_1.2.0      
##  [4] dplyr_0.7.4         purrr_0.2.4         readr_1.1.1        
##  [7] tidyr_0.7.2         tibble_1.3.4        tidyverse_1.2.1    
## [10] Seurat_2.1.0        Biobase_2.36.2      BiocGenerics_0.22.1
## [13] Matrix_1.2-12       cowplot_0.9.1       ggplot2_2.2.1      
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.0.0         backports_1.1.2      Hmisc_4.0-3         
##   [4] VGAM_1.0-4           NMF_0.20.6           sn_1.5-1            
##   [7] plyr_1.8.4           igraph_1.1.2         lazyeval_0.2.1      
##  [10] splines_3.4.0        gridBase_0.4-7       digest_0.6.13       
##  [13] foreach_1.4.4        htmltools_0.3.6      lars_1.2            
##  [16] gdata_2.18.0         magrittr_1.5         checkmate_1.8.5     
##  [19] cluster_2.0.6        doParallel_1.0.11    mixtools_1.1.0      
##  [22] ROCR_1.0-7           sfsmisc_1.1-1        recipes_0.1.1       
##  [25] modelr_0.1.1         gower_0.1.2          dimRed_0.1.0        
##  [28] R.utils_2.6.0        colorspace_1.3-2     rvest_0.3.2         
##  [31] haven_1.1.0          crayon_1.3.4         jsonlite_1.5        
##  [34] bindr_0.1            survival_2.41-3      iterators_1.0.9     
##  [37] ape_5.0              glue_1.2.0           DRR_0.0.2           
##  [40] registry_0.5         gtable_0.2.0         ipred_0.9-6         
##  [43] kernlab_0.9-25       ddalpha_1.3.1        prabclus_2.2-6      
##  [46] DEoptimR_1.0-8       scales_0.5.0         mvtnorm_1.0-6       
##  [49] rngtools_1.2.4       Rcpp_0.12.14         dtw_1.18-1          
##  [52] xtable_1.8-2         htmlTable_1.11.0     tclust_1.3-1        
##  [55] foreign_0.8-69       proxy_0.4-20         mclust_5.4          
##  [58] SDMTools_1.1-221     Formula_1.2-2        stats4_3.4.0        
##  [61] tsne_0.1-3           lava_1.5.1           prodlim_1.6.1       
##  [64] httr_1.3.1           htmlwidgets_0.9      FNN_1.1             
##  [67] gplots_3.0.1         RColorBrewer_1.1-2   fpc_2.1-10          
##  [70] acepack_1.4.1        modeltools_0.2-21    ica_1.0-1           
##  [73] pkgconfig_2.0.1      R.methodsS3_1.7.1    flexmix_2.3-14      
##  [76] nnet_7.3-12          caret_6.0-78         labeling_0.3        
##  [79] tidyselect_0.2.3     rlang_0.1.4          reshape2_1.4.3      
##  [82] cellranger_1.1.0     munsell_0.4.3        tools_3.4.0         
##  [85] cli_1.0.0            ranger_0.8.0         broom_0.4.3         
##  [88] ggridges_0.4.1       evaluate_0.10.1      yaml_2.1.16         
##  [91] ModelMetrics_1.1.0   knitr_1.17           robustbase_0.92-8   
##  [94] caTools_1.17.1       pbapply_1.3-3        nlme_3.1-131        
##  [97] R.oo_1.21.0          RcppRoll_0.2.2       xml2_1.1.1          
## [100] compiler_3.4.0       rstudioapi_0.7       ggjoy_0.4.0         
## [103] stringi_1.1.6        lattice_0.20-35      trimcluster_0.1-2   
## [106] psych_1.7.8          diffusionMap_1.1-0   data.table_1.10.4-3 
## [109] bitops_1.0-6         irlba_2.3.1          R6_2.2.2            
## [112] latticeExtra_0.6-28  KernSmooth_2.23-15   gridExtra_2.3       
## [115] codetools_0.2-15     MASS_7.3-47          gtools_3.5.0        
## [118] assertthat_0.2.0     CVST_0.2-1           pkgmaker_0.22       
## [121] rprojroot_1.2        withr_2.1.0          mnormt_1.5-5        
## [124] hms_0.4.0            diptest_0.75-7       grid_3.4.0          
## [127] rpart_4.1-11         timeDate_3042.101    class_7.3-14        
## [130] rmarkdown_1.8        segmented_0.5-3.0    Rtsne_0.13          
## [133] numDeriv_2016.8-1    scatterplot3d_0.3-40 lubridate_1.7.1     
## [136] base64enc_0.1-3

5 References

Bacher and Kendziorski, Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol. 2016; 17:63.

Blondel et al., Fast unfolding of communities in large networks. Paper available at: arXiv:0803.0476v1 2008

Buettner et al., Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015; 33(2):155-60.

Chung and Storey, Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics 2015; 31(4):545-54.

Finak et al., MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 2015; 16:278.

Ilicic et al., Classification of low quality cells from single-cell RNA-seq data. Genome Biol. 2016; 17:29.

Levine et al., Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 2015; 62(1):184-97.

Satija et al., Spatial reconstruction of single-cell gene expression data. Nat Biotechnol. 2015; 33(5):495-502.

van der Maaten and Hinton, Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research 2008; 9(Nov):2579-2605.

Xu and Su, Identification f cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 2015; 31:1974-80.